{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collaborative filtering\n",
    "-----\n",
    "\n",
    "In this example, we'll build a quick explicit feedback recommender system: that is, a model that takes into account explicit feedback signals (like ratings) to recommend new content.\n",
    "\n",
    "We'll use an approach first made popular by the [Netflix prize](http://www.netflixprize.com/) contest: [matrix factorization](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf). \n",
    "\n",
    "The basic idea is very simple:\n",
    "\n",
    "1. Start with user-item-rating triplets, conveying the information that user _i_ gave some item _j_ rating _r_.\n",
    "2. Represent both users and items as high-dimensional vectors of numbers. For example, a user could be represented by `[0.3, -1.2, 0.5]` and an item by `[1.0, -0.3, -0.6]`.\n",
    "3. The representations should be chosen so that, when we multiplied together (via [dot products](https://en.wikipedia.org/wiki/Dot_product)), we can recover the original ratings.\n",
    "4. The utility of the model then is derived from the fact that if we multiply the user vector of a user with the item vector of some item they _have not_ rated, we hope to obtain a predicition for the rating they would have given to it had they seen it.\n",
    "\n",
    "![collaborative filtering](matrix_factorization.png)\n",
    "source:[ampcamp.berkeley](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Preparations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os.path as op\n",
    "\n",
    "from zipfile import ZipFile\n",
    "try:\n",
    "    from urllib.request import urlretrieve\n",
    "except ImportError:  # Python 2 compat\n",
    "    from urllib import urlretrieve\n",
    "\n",
    "# this line need to be changed:\n",
    "data_folder = '/home/lelarge/data/'\n",
    "\n",
    "ML_100K_URL = \"http://files.grouplens.org/datasets/movielens/ml-100k.zip\"\n",
    "ML_100K_FILENAME = op.join(data_folder,ML_100K_URL.rsplit('/', 1)[1])\n",
    "ML_100K_FOLDER = op.join(data_folder,'ml-100k')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start with importing a famous dataset, the [Movielens 100k dataset](https://grouplens.org/datasets/movielens/100k/). It contains 100,000 ratings (between 1 and 5) given to 1683 movies by 944 users:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "if not op.exists(ML_100K_FILENAME):\n",
    "    print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))\n",
    "    urlretrieve(ML_100K_URL, ML_100K_FILENAME)\n",
    "\n",
    "if not op.exists(ML_100K_FOLDER):\n",
    "    print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))\n",
    "    ZipFile(ML_100K_FILENAME).extractall(data_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other datasets, see: [Movielens](https://grouplens.org/datasets/movielens/)\n",
    "\n",
    "Possible soft to benchmark: [Lenskit](http://lenskit.org/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data analysis and formating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Python Data Analysis Library](http://pandas.pydata.org/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "all_ratings = pd.read_csv(op.join(ML_100K_FOLDER, 'u.data'), sep='\\t',\n",
    "                          names=[\"user_id\", \"item_id\", \"ratings\", \"timestamp\"])\n",
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check out a few macro-stats of our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#number of entries\n",
    "len(all_ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ratings['ratings'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of unique rating values\n",
    "len(all_ratings['ratings'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ratings['user_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of unique users\n",
    "total_user_id = len(all_ratings['user_id'].unique())\n",
    "print(total_user_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_ratings['item_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of unique rated items\n",
    "total_item_id = len(all_ratings['item_id'].unique())\n",
    "print(total_item_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For spliting the data into _train_ and _test_ we'll be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "ratings_train, ratings_test = train_test_split(\n",
    "    all_ratings, test_size=0.2, random_state=42)\n",
    "\n",
    "user_id_train = ratings_train['user_id']\n",
    "item_id_train = ratings_train['item_id']\n",
    "rating_train = ratings_train['ratings']\n",
    "\n",
    "user_id_test = ratings_test['user_id']\n",
    "item_id_test = ratings_test['item_id']\n",
    "rating_test = ratings_test['ratings']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(user_id_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(user_id_train.unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(item_id_train.unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that all the movies are not rated in the train set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_id_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "item_id_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rating_train.iloc[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. The model\n",
    "\n",
    "We can feed our dataset to the `FactorizationModel` class - a sklearn-like object that allows us to train and evaluate the explicit factorization models.\n",
    "\n",
    "Internally, the model uses the `Model_dot`(class to represents users and items. It's composed of a 4 `embedding` layers:\n",
    "\n",
    "- a `(num_users x latent_dim)` embedding layer to represent users,\n",
    "- a `(num_items x latent_dim)` embedding layer to represent items,\n",
    "- a `(num_users x 1)` embedding layer to represent user biases, and\n",
    "- a `(num_items x 1)` embedding layer to represent item biases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's generate [Embeddings](http://pytorch.org/docs/master/nn.html#embedding) for the users, _i.e._ a fixed-sized vector describing the user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_dim = 3\n",
    "embedding_user = nn.Embedding(total_user_id, embedding_dim)\n",
    "input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])\n",
    "embedding_user(input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure to check out ```torch_utils.py``` file to find the helper functions used in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import imp\n",
    "import torch_utils; imp.reload(torch_utils)\n",
    "\n",
    "from torch_utils import ScaledEmbedding, ZeroEmbedding\n",
    "\n",
    "class DotModel(nn.Module):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 num_users,\n",
    "                 num_items,\n",
    "                 embedding_dim=32):\n",
    "        \n",
    "        super(DotModel, self).__init__()\n",
    "        \n",
    "        self.embedding_dim = embedding_dim\n",
    "        \n",
    "        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)\n",
    "        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)\n",
    "        self.user_biases = ZeroEmbedding(num_users, 1)\n",
    "        self.item_biases = ZeroEmbedding(num_items, 1)\n",
    "                \n",
    "        \n",
    "    def forward(self, user_ids, item_ids):\n",
    "        \n",
    "        #\n",
    "        # your code here\n",
    "        #\n",
    "\n",
    "        return dot + user_bias + item_bias\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dot_net = DotModel(total_user_id,total_item_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_id_tensor = torch.from_numpy(np.array([1,2,3,1]))\n",
    "item_id_tensor = torch.from_numpy(np.array([1,3,4,1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dot_net(user_id_tensor,item_id_tensor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import imp\n",
    "import numpy as np\n",
    "\n",
    "import torch.optim as optim\n",
    "\n",
    "import torch_utils; imp.reload(torch_utils)\n",
    "from torch_utils import gpu, minibatch, shuffle, regression_loss\n",
    "\n",
    "class FactorizationModel(object):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 embedding_dim=32,\n",
    "                 n_iter=10,\n",
    "                 batch_size=256,\n",
    "                 l2=0.0,\n",
    "                 learning_rate=1e-2,\n",
    "                 use_cuda=False,\n",
    "                 net=None,\n",
    "                 num_users=None,\n",
    "                 num_items=None, \n",
    "                 random_state=None):\n",
    "        \n",
    "        self._embedding_dim = embedding_dim\n",
    "        self._n_iter = n_iter\n",
    "        self._learning_rate = learning_rate\n",
    "        self._batch_size = batch_size\n",
    "        self._l2 = l2\n",
    "        self._use_cuda = use_cuda\n",
    "        \n",
    "        self._num_users = num_users\n",
    "        self._num_items = num_items\n",
    "        self._net = net\n",
    "        self._optimizer = None\n",
    "        self._loss_func = None\n",
    "        self._random_state = random_state or np.random.RandomState()\n",
    "             \n",
    "        \n",
    "    def _initialize(self):\n",
    "        if self._net is None:\n",
    "            self._net = gpu(DotModel(self._num_users, self._num_items, self._embedding_dim),self._use_cuda)\n",
    "        \n",
    "        self._optimizer = optim.Adam(\n",
    "                self._net.parameters(),\n",
    "                lr=self._learning_rate,\n",
    "                weight_decay=self._l2\n",
    "            )\n",
    "        \n",
    "        self._loss_func = regression_loss\n",
    "    \n",
    "    @property\n",
    "    def _initialized(self):\n",
    "        return self._optimizer is not None\n",
    "    \n",
    "    def __repr__(self):\n",
    "        return _repr_model(self)\n",
    "    \n",
    "    def fit(self, user_ids, item_ids, ratings, verbose=True):\n",
    "        \n",
    "        user_ids = user_ids.astype(np.int64)\n",
    "        item_ids = item_ids.astype(np.int64)\n",
    "        \n",
    "        if not self._initialized:\n",
    "            self._initialize()\n",
    "            \n",
    "        for epoch_num in range(self._n_iter):\n",
    "            users, items, ratingss = shuffle(user_ids,\n",
    "                                            item_ids,\n",
    "                                            ratings)\n",
    "\n",
    "            user_ids_tensor = gpu(torch.from_numpy(users),\n",
    "                                  self._use_cuda)\n",
    "            item_ids_tensor = gpu(torch.from_numpy(items),\n",
    "                                  self._use_cuda)\n",
    "            ratings_tensor = gpu(torch.from_numpy(ratingss),\n",
    "                                 self._use_cuda)\n",
    "            epoch_loss = 0.0\n",
    "\n",
    "            for (minibatch_num,\n",
    "                 (batch_user,\n",
    "                  batch_item,\n",
    "                  batch_rating)) in enumerate(minibatch(self._batch_size,\n",
    "                                                         user_ids_tensor,\n",
    "                                                         item_ids_tensor,\n",
    "                                                         ratings_tensor)):\n",
    "                \n",
    "                \n",
    "                # to be completed...\n",
    "                predictions = \n",
    "                #\n",
    "                loss = \n",
    "                epoch_loss = \n",
    "                #\n",
    "                #\n",
    "                \n",
    "            \n",
    "            epoch_loss = epoch_loss / (minibatch_num + 1)\n",
    "\n",
    "            if verbose:\n",
    "                print('Epoch {}: loss {}'.format(epoch_num, epoch_loss))\n",
    "        \n",
    "            if np.isnan(epoch_loss) or epoch_loss == 0.0:\n",
    "                raise ValueError('Degenerate epoch loss: {}'\n",
    "                                 .format(epoch_loss))\n",
    "    \n",
    "    \n",
    "    def test(self,user_ids, item_ids, ratings):\n",
    "        self._net.train(False)\n",
    "        user_ids = user_ids.astype(np.int64)\n",
    "        item_ids = item_ids.astype(np.int64)\n",
    "        \n",
    "        user_ids_tensor = gpu(torch.from_numpy(user_ids),\n",
    "                                  self._use_cuda)\n",
    "        item_ids_tensor = gpu(torch.from_numpy(item_ids),\n",
    "                                  self._use_cuda)\n",
    "        ratings_tensor = gpu(torch.from_numpy(ratings),\n",
    "                                 self._use_cuda)\n",
    "               \n",
    "        predictions = self._net(user_ids_tensor, item_ids_tensor)\n",
    "        \n",
    "        loss = self._loss_func(ratings_tensor, predictions)\n",
    "        return loss.data.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = FactorizationModel(embedding_dim=128,  # latent dimensionality\n",
    "                                   n_iter=10,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   learning_rate=1e-3,\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   use_cuda=torch.cuda.is_available(),\n",
    "                                   num_users=total_user_id+1,\n",
    "                                   num_items=total_item_id+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "user_ids_train_np = user_id_train.as_matrix().astype(np.int32)\n",
    "item_ids_train_np = item_id_train.as_matrix().astype(np.int32)\n",
    "ratings_train_np = rating_train.as_matrix().astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model._net)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_ids_test_np = user_id_test.as_matrix().astype(np.int64)\n",
    "item_ids_test_np = item_id_test.as_matrix().astype(np.int64)\n",
    "ratings_test_np = rating_test.as_matrix().astype(np.float32)\n",
    "model.test(user_ids_test_np, item_ids_test_np, ratings_test_np  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like we are already overfitting..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Analysing and interpreting the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "user_emb_np = model._net.user_embeddings.weight.data.numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "item_emb_np = model._net.item_embeddings.weight.data.numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "\n",
    "item_tsne = TSNE(perplexity=30).fit_transform(item_emb_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(10, 10))\n",
    "plt.scatter(item_tsne[:, 0], item_tsne[:, 1]);\n",
    "plt.xticks(()); plt.yticks(());\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting the name of the movies (there must be a better way, please provide alternate solutions!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|', names=['item_id', 'item_name','date','','','','','','','','','','','','','','','','','','','','',''],encoding = \"ISO-8859-1\")\n",
    "movies_names = df.loc[:,['item_id', 'item_name']]\n",
    "movies_names = movies_names.set_index(['item_id'])\n",
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "item_bias_np = model._net.item_biases.weight.data.numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "movies_names['biases'] = pd.Series(item_bias_np[1:].T[0], index=movies_names.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "indices_item_train = np.sort(item_id_train.unique())\n",
    "movies_names = movies_names.loc[indices_item_train]\n",
    "movies_names.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "movies_names = movies_names.sort_values(ascending=False,by=['biases'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Best movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Worse movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. SPOTLIGHT\n",
    "\n",
    "The code written above is a simplified version of [SPOTLIGHT](https://github.com/maciejkula/spotlight)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you installed it with: `conda install -c maciejkula -c pytorch spotlight=0.1.5`, you can compare the results..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from spotlight.datasets.movielens import get_movielens_dataset\n",
    "\n",
    "dataset = get_movielens_dataset(variant='100K')\n",
    "\n",
    "from spotlight.cross_validation import random_train_test_split\n",
    "\n",
    "train, test = random_train_test_split(dataset, random_state=np.random.RandomState(42))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = FactorizationModel(embedding_dim=128,  # latent dimensionality\n",
    "                                   n_iter=10,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   learning_rate=1e-3,\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   use_cuda=torch.cuda.is_available(),\n",
    "                                   num_users=total_user_id+1,\n",
    "                                   num_items=total_item_id+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(train.user_ids,train.item_ids,train.ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "from spotlight.factorization.explicit import ExplicitFactorizationModel\n",
    "\n",
    "model_spot = ExplicitFactorizationModel(loss='regression',\n",
    "                                   embedding_dim=128,  # latent dimensionality\n",
    "                                   n_iter=10,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   learning_rate=1e-3,\n",
    "                                   use_cuda=torch.cuda.is_available())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_spot.fit(train, verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "item_emb_spot_np = model_spot._net.item_embeddings.weight.data.numpy()\n",
    "item_bias_spot_np = model_spot._net.item_biases.weight.data.numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "movies_names = df.loc[:,['item_id', 'item_name']]\n",
    "movies_names = movies_names.set_index(['item_id'])\n",
    "movies_names['biases_S'] = pd.Series(item_bias_spot_np[1:].T[0], index=movies_names.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "indices_item_train = np.sort(item_id_train.unique())\n",
    "movies_names = movies_names.loc[indices_item_train]\n",
    "movies_names.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "movies_names = movies_names.sort_values(ascending=False,by=['biases_S'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_names.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Further Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "pca = PCA(n_components=3)\n",
    "movie_pca = pca.fit(item_emb_np.T).components_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fac0 = movie_pca[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "item_comp = [(f, movies_names.loc[i,'item_name']) for f,i in zip(fac0, indices_item_train)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import operator;\n",
    "sorted(item_comp, key=operator.itemgetter(0))[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(item_comp, key=operator.itemgetter(0), reverse=True)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fac1 = movie_pca[1]\n",
    "item_comp = [(f, movies_names.loc[i,'item_name']) for f,i in zip(fac1, indices_item_train)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(item_comp, key=operator.itemgetter(0))[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted(item_comp, key=operator.itemgetter(0), reverse=True)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "fac2 = movie_pca[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start=50; end=100\n",
    "X = fac0[start:end]\n",
    "Y = fac2[start:end]\n",
    "plt.figure(figsize=(15,15))\n",
    "plt.scatter(X, Y)\n",
    "for i, x, y in zip(indices_item_train[start:end], X, Y):\n",
    "    plt.text(x,y,movies_names.loc[i,'item_name'], color=np.random.rand(3)*0.7, fontsize=14)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise\n",
    "\n",
    "Previous analysis is not very convincing. Redo it with the dataset ml-latest-small.zip\n",
    "\n",
    "This dataset contains much more movies and less users. Each user rated at least 20 movies.\n",
    "\n",
    "To make the analysis of the factors more interesting, restrict it to the top 2000 most popular movies.\n",
    "\n",
    "You should obtain something like:\n",
    "\n",
    "![PCA movies](pca_movies.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. Neural Net"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import models; imp.reload(models)\n",
    "from models import DeepModel\n",
    "import ExplicitFactorizationModel; imp.reload(ExplicitFactorizationModel)\n",
    "from ExplicitFactorizationModel import ExplicitFactorizationModel\n",
    "from torch_utils import l1_loss\n",
    "\n",
    "model = ExplicitFactorizationModel(embedding_dim=128,  # latent dimensionality\n",
    "                                   n_iter=10,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   learning_rate=1e-3,\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   use_cuda=torch.cuda.is_available(),\n",
    "                                   num_users=total_user_id+1,\n",
    "                                   num_items=total_item_id+1,\n",
    "                                  loss = l1_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(user_ids_train_np, item_ids_train_np, ratings_train_np,\n",
    "          user_ids_test_np, item_ids_test_np, ratings_test_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model_deep = ExplicitFactorizationModel(embedding_dim=128,  # latent dimensionality\n",
    "                                   n_iter=10,  # number of epochs of training\n",
    "                                   batch_size=1024,  # minibatch size\n",
    "                                   learning_rate=1e-4,\n",
    "                                   l2=1e-9,  # strength of L2 regularization\n",
    "                                   use_cuda=torch.cuda.is_available(),\n",
    "                                   num_users=total_user_id+1,\n",
    "                                   num_items=total_item_id+1,\n",
    "                                   net=DeepModel(total_user_id+1,total_item_id+1,128),\n",
    "                                       loss=l1_loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_deep.fit(user_ids_train_np, item_ids_train_np, ratings_train_np,\n",
    "          user_ids_test_np, item_ids_test_np, ratings_test_np)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "test_preds = model.predict(user_ids_test_np, item_ids_test_np)\n",
    "print(\"Final test MSE: %0.3f\" % mean_squared_error(test_preds, rating_test))\n",
    "print(\"Final test MAE: %0.3f\" % mean_absolute_error(test_preds, rating_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_preds_deep = model_deep.predict(user_ids_test_np, item_ids_test_np)\n",
    "print(\"Final test MSE: %0.3f\" % mean_squared_error(test_preds_deep, rating_test))\n",
    "print(\"Final test MAE: %0.3f\" % mean_absolute_error(test_preds_deep, rating_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_deep._net)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Exercise\n",
    "\n",
    "Add another layer and compare."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding most similar items\n",
    "Finding k most similar items to a point in embedding space\n",
    "\n",
    "- Write in numpy a function to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two points in embedding space\n",
    "- Write a function which computes the euclidean distance between a point in embedding space and all other points\n",
    "- Write a most similar function, which returns the k item names with lowest euclidean distance / highest cosine similarity.\n",
    "- Try with a movie index, such as 181 (Return of the Jedi). What do you observe? Don't expect miracles on such a small training set but still give your results on the forum!\n",
    "\n",
    "Notes:\n",
    "- you may use `np.linalg.norm` to compute the norm of vector, and you may specify the `axis=`\n",
    "- the numpy function `np.argsort(...)` enables to compute the sorted indices of a vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:pytorch]",
   "language": "python",
   "name": "conda-env-pytorch-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
